31 research outputs found

    Design and Field Test of a WSN Platform Prototype for Long-Term Environmental Monitoring

    Get PDF
    Long-term wildfire monitoring using distributed in situ temperature sensors is an accurate, yet demanding environmental monitoring application, which requires long-life, low-maintenance, low-cost sensors and a simple, fast, error-proof deployment procedure. We present in this paper the most important design considerations and optimizations of all elements of a low-cost WSN platform prototype for long-term, low-maintenance pervasive wildfire monitoring, its preparation for a nearly three-month field test, the analysis of the causes of failure during the test and the lessons learned for platform improvement. The main components of the total cost of the platform (nodes, deployment and maintenance) are carefully analyzed and optimized for this application. The gateways are designed to operate with resources that are generally used for sensor nodes, while the requirements and cost of the sensor nodes are significantly lower. We define and test in simulation and in the field experiment a simple, but effective communication protocol for this application. It helps to lower the cost of the nodes and field deployment procedure, while extending the theoretical lifetime of the sensor nodes to over 16 years on a single 1 Ah lithium battery

    Asynchronous Resilient Wireless Sensor Network for Train Integrity Monitoring

    Get PDF
    To increase railway use efficiency, the European Railway Traffic Management System (ERTMS) Level 3 requires all trains to constantly and reliably self-monitor and report their integrity and track position without infrastructure support. Timely train separation detection is challenging, especially for long freight trains without electrical power on cars. Data fusion of multiple monitoring techniques is currently investigated, including distributed integrity sensing of all train couplings. We propose a Wireless Sensor Network (WSN) topology, communication protocol, application, and sensor nodes prototypes designed for low power timely train integrity reporting in unreliable conditions, like intermittent node operation and network association (e.g., in low environmental energy harvesting conditions) and unreliable radio links. Each train coupling is redundantly monitored by four sensors, which can help to satisfy the Train Collision Avoidance System (TCAS) and European Committee for Electrotechnical Standardization (CENELEC) SIL 4 requirements and contribute to the reliability of the asynchronous network with low rejoin overhead. A control center on the locomotive controls the WSN and receives the reports, helping the integration in railway or Internet of Things (IoT) applications. Software simulations of the embedded application code virtually unchanged show that the energy-optimized configurations check a 50-car train integrity (about 1 km long) in 3.6 s average with 0.1 s standard deviation and that more than 95 % of the reports are delivered successfully with up to one-third of communications or up to 15 % of the nodes failed. We also report qualitative test results for a 20-node network in different experimental conditions

    Interactive Trace-Based Analysis Toolset for Manual Parallelization of C Programs

    Get PDF
    Massive amounts of legacy sequential code need to be parallelized to make better use of modern multiprocessor architectures. Nevertheless, writing parallel programs is still a difficult task. Automated parallelization methods can be effective both at the statement and loop levels and, recently, at the task level, but they are still restricted to specific source code constructs or application domains. We present in this article an innovative toolset that supports developers when performing manual code analysis and parallelization decisions. It automatically collects and represents the program profile and data dependencies in an interactive graphical format that facilitates the analysis and discovery of manual parallelization opportunities. The toolset can be used for arbitrary sequential C programs and parallelization patterns. Also, its program-scope data dependency tracing at runtime can complement the tools based on static code analysis and can also benefit from it at the same time. We also tested the effectiveness of the toolset in terms of time to reach parallelization decisions and of their quality. We measured a significant improvement for several real-world representative applications

    Enhanced Exploration of Neural Network Models for Indoor Human Monitoring

    Get PDF
    Indoor human monitoring can enable or enhance a wide range of applications, from medical to security and home or building automation. For effective ubiquitous deployment, the monitoring system should be easy to install and unobtrusive, reliable, low cost, tagless, and privacy-aware. Long-range capacitive sensors are good candidates, but they can be susceptible to environmental electromagnetic noise and require special signal processing. Neural networks (NNs), especially 1D convolutional neural networks (1D-CNNs), excel at extracting information and rejecting noise, but they lose important relationships in max/average pooling operations. We investigate the performance of NN architectures for time series analysis without this shortcoming, the capsule networks that use dynamic routing, and the temporal convolutional networks (TCNs) that use dilated convolutions to preserve input resolution across layers and extend their receptive field with fewer layers. The networks are optimized for both inference accuracy and resource consumption using two independent state-of-the-art methods, neural architecture search and knowledge distillation. Experimental results show that the TCN architecture performs the best, achieving 12.7% lower inference loss with 73.3% less resource consumption than the best 1D-CNN when processing noisy capacitive sensor data for indoor human localization and tracking

    Exact and heuristic allocation of multi-kernel applications to multi-FPGA platforms

    Get PDF
    FPGA-based accelerators demonstrated high energy efficiency compared to GPUs and CPUs. However, single FPGA designs may not achieve sufficient task parallelism. In this work, we optimize the mapping of high-performance multi-kernel applications, like Convolutional Neural Networks, to multi-FPGA platforms. First, we formulate the system level optimization problem, choosing within a huge design space the parallelism and number of compute units for each kernel in the pipeline. Then we solve it using a combination of Geometric Programming, producing the optimum performance solution given resource and DRAM bandwidth constraints, and a heuristic allocator of the compute units on the FPGA cluster.Peer ReviewedPostprint (author's final draft

    Design and Optimization of Residual Neural Network Accelerators for Low-Power FPGAs Using High-Level Synthesis

    Full text link
    Residual neural networks are widely used in computer vision tasks. They enable the construction of deeper and more accurate models by mitigating the vanishing gradient problem. Their main innovation is the residual block which allows the output of one layer to bypass one or more intermediate layers and be added to the output of a later layer. Their complex structure and the buffering required by the residual block make them difficult to implement on resource-constrained platforms. We present a novel design flow for implementing deep learning models for field programmable gate arrays optimized for ResNets, using a strategy to reduce their buffering overhead to obtain a resource-efficient implementation of the residual layer. Our high-level synthesis (HLS)-based flow encompasses a thorough set of design principles and optimization strategies, exploiting in novel ways standard techniques such as temporal reuse and loop merging to efficiently map ResNet models, and potentially other skip connection-based NN architectures, into FPGA. The models are quantized to 8-bit integers for both weights and activations, 16-bit for biases, and 32-bit for accumulations. The experimental results are obtained on the CIFAR-10 dataset using ResNet8 and ResNet20 implemented with Xilinx FPGAs using HLS on the Ultra96-V2 and Kria KV260 boards. Compared to the state-of-the-art on the Kria KV260 board, our ResNet20 implementation achieves 2.88X speedup with 0.5% higher accuracy of 91.3%, while ResNet8 accuracy improves by 2.8% to 88.7%. The throughputs of ResNet8 and ResNet20 are 12971 FPS and 3254 FPS on the Ultra96 board, and 30153 FPS and 7601 FPS on the Kria KV26, respectively. They Pareto-dominate state-of-the-art solutions concerning accuracy, throughput, and energy

    CUDA-Optimized GPU Acceleration of 3GPP 3D Channel Model Simulations for 5G Network Planning

    Get PDF
    Simulation of massive multiple-input multiple-output (MIMO) channel models is becoming increasingly important for testing and validation of fifth-generation new radio (5G NR) wireless networks and beyond. However, simulation performance tends to be limited when modeling a large number of antenna elements combined with a complex and realistic representation of propagation conditions. In this paper, we propose an efficient implementation of a 3rd Generation Partnership Project (3GPP) three-dimensional (3D) channel model, specifically designed for graphics processing unit (GPU) platforms, with the goal of minimizing the computational time required for channel simulation. The channel model is highly parameterized to encompass a wide range of configurations required for real-world optimized 5G NR network deployments. We use several compute unified device architecture (CUDA)-based optimization techniques to exploit the parallelism and memory hierarchy of the GPU. Experimental data show that the developed system achieves an overall speedup of about 240× compared to the original C++ model executed on an Intel processor. Compared to a design previously accelerated on a datacenter-class field programmable gate array (FPGA), the GPU design has 33.3 % higher single precision performance, but for 7.5 % higher power consumption. The proposed GPU accelerator can provide fast and accurate channel simulations for 5G NR network planning and optimization

    Array-specific dataflow caches for high-level synthesis of memory-intensive algorithms on FPGAs

    Get PDF
    Designs implemented on field-programmable gate arrays (FPGAs) via high-level synthesis (HLS) suffer from off-chip memory latency and bandwidth bottlenecks. FPGAs can access both large but slow off-chip memories (DRAM), and fast but small on-chip memories (block RAMs and registers). HLS tools allow exploiting the memory hierarchy in a scratchpad-like fashion, requring a significant manual effort. We propose an automation of the FPGA memory management in Xilinx Vitis HLS through a fully- configurable C++ source-level cache. Each DRAM-mapped array can be associated with a private level 2 (L2) cache with one or more ports, and each port can optionally provide level 1 cache. The L2 cache runs in a separate dataflow task with respect to the application accessing it. This solution isolates off-chip memory accesses and data buffering into dedicated dataflow tasks, resembling the load, compute, store design paradigm, but without the drawback of manual algorithm refactoring. Experimental results collected from FPGA board show that our cache speeds up the execution of a variety of benchmarks by up to 60 times compared to the out-of-the-box solution provided by HLS, requiring very limited optimization effort. Our caches are not meant to compete with manually optimized implementations quality of results (QoR), but rather to significantly save design effort, in exchange for some QoR, to make the HLS flow a bit more software-like, allowing the designer to focus on algorithmic optimizations, rather than on explicit memory management. Moreover, caching could be the only feasible memory optimization for algorithms with data-dependent or irregular memory access patterns, but with good data locality

    Mix & Latch: An Optimization Flow for High-Performance Designs with Single-Clock Mixed-Polarity Latches and Flip-Flops

    Get PDF
    Flip-flops are the most used sequential elements in synchronous circuits, but designs based on latches can operate at higher frequencies and occupy less area. Techniques to increase the maximum operating frequency of flip-flop based designs, such as time-borrowing, rely on tight hold constraints that are difficult to satisfy using traditional back-end optimization techniques. We propose Mix & Latch , a methodology to increase the operating frequency of synchronous digital circuits using a single clock tree and a mixed distribution of positive- and negative-edge-triggered flops, and positive- and negative-level-sensitive latches. An efficient mathematical model is proposed to optimize the type and location of the sequential elements of the circuit. We ensure that the initial registers are not moved from their initial location, although they may change type, thus allowing the use of equivalence checking and static timing analysis to verify formally the correctness of the transformation. The technique is validated using a 28nm CMOS FDSOI technology, obtaining 1.33X post-layout average operating frequency improvement on a broad set of benchmarks over a standard commercial design flow. Additionally, the circuit area was also reduced by more than 1.19X on average for the same benchmarks, although the overall area reduction is not a goal of the optimization algorithm. To the best of our knowledge, this is the first work that proposes combining mixed-polarity flip-flops and latches to improve the circuit performance

    CNN-on-AWS: Efficient Allocation of Multi-Kernel Applications on Multi-FPGA Platforms

    Get PDF
    Multi-FPGA platforms, like Amazon AWS F1, can run in the cloud multi-kernel pipelined applications, like Convolutional Neural Networks (CNNs), with excellent performance and lower energy consumption than CPUs or GPUs. We propose a method to efficiently map these applications on multi-FPGA platforms to maximize application throughput. Our methodology finds, for the given resources, the optimal number of parallel instances of each kernel in the pipeline and their allocation to one or more among the available FPGAs. We obtain this by formulating and solving a mixed-integer, non-linear optimization problem, in which we model the performance of each component and the duration of the phases in which the accelerated computation can be split into, namely: 1) data transfer from a host CPU to the DDR memory of each FPGA, 2) data transfer from FPGA DDR to FPGA on-chip memory, 3) kernel computation on the FPGA, 4) data transfer from FPGA on-chip memory to FPGA DDR, 5) data transfer from FPGA DDR to host. Finding the optimal solution using a Mixed-Integer Non-Linear Programming (MINLP) solver is often highly inefficient. Hence, we provide a fast heuristic method that according to our experiments can be much more efficient than the MINLP solver and finds comparable results. For larger problems (more CNN layers), our heuristic method can quickly find (several thousand times faster) much better solutions than the MINLP solver, even if we run the latter for a very long time
    corecore